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3  Neural  Networks 

a  a  short  introduction 
3  Independent  Component  Analysis 
a  Stochastic  signal  processing 
m  Constraint  optimization  in  ICA 

3  Geometric  Integration  of  Learning  equations 

*  gradient  flows  and  algorithms  on  manifolds 
a  MEC  learning 
a  Newton  methods 
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Neural  Networks 


Goals: 

3  Achieve  efficient  use  of  machines  in  tasks  currently  solved  by 
humans 

*  Improve  computing  capabilities  looking  at  the  brain  as  a  model 

*  Understand  how  the  brain  works 
Applications 

*  Machine  Learning 

1 .  How  can  a  computer  learn  from  a  set  of  examples? 

2.  Constraint  optimization 

3.  Pattern  recognition,  classification 

4.  Associative  memory 

.•  Cognitive  science 

1 .  Models  for  high  level  reasoning:  language,  problem  solving 

2.  Models  for  low  level  reasoning:  vision,  speech  recognition, 
speech  generation 

.*  Neurobiology:  find  models  for  how  the  brain  works 
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List  of  fields  where  Neural  Networks  are  used 


*  Signal  processing 
M  Control 

M  Robotics  (navigation,  vision) 

*  Medicine 

M  Business  and  Finance 
.•  Data  Compression 
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The  brain  as  an  Information  Processing  System 


*  Massively  parallel:  1 0  billion  neurons,  1 0000  synapses  per  neuron 

*  Slow  hardware:  neurons  operate  at  about  1 00  Hz,  while 
conventional  CPUs  execute  several  hundred  million  machine  level 
operations  per  second 


sensorimotor  area 


auditory  association 
(including  Wernicke's 
area,  in  left  hemisphere) 
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The  brain  as  an  Information  Processing  System 


»  Massively  parallel:  1 0  billion  neurons,  1 0000  synapses  per  neuron 

M  Slow  hardware:  neurons  operate  at  about  1 00  Hz,  while 

conventional  CPUs  execute  several  hundred  million  machine  level 
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A  Motor  Neuron 
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The  brain  as  an  Information  Processing  System 


B  Massively  parallel:  1 0  billion  neurons,  1 0000  synapses  per  neuron 

B  Slow  hardware:  neurons  operate  at  about  1 00  Hz,  while 

conventional  CPUs  execute  several  hundred  million  machine  level 
operations  per  second 

Synapse:  transmission  of  a  signal  between  neurons  via  a 
neurotransmitter.  Learning  corresponds  to  alteration  of  the  strength  of 
the  connection  between  neurons. 


A  Synapse 


neurotransmittor 


reuptake  of 
neurotransmittor 
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A  simple  model  for  a  neuron 


Each  node  (neuron)  receives  signal  inputs  form  n  neighbor  nodes. 

yi  =  f(Ewijyj) 

:i 


The  weighted  sum  V .  witjyj  is  called  the  net  input.  /  is  the  activation 
function,  if  /  is  the  identity  we  have  a  linear  unit,  y.,  is  the  output  signal 
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Linear  Neural  Networks 


Several  inputs  one  output 


http://www.willamette.edu/  gorr 


o  o  o  outputs  o 


o  o  ©  inputs  i 


n  inputs  p  outputs 
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Independent  Component  Analysis 


The  cocktail-party  problem 

Suppose  you  record  two  time  signals  xx (t)  and  x2(t)  form  two  different 
positions  in  a  room.  Each  recorded  signal  is  a  linear  mixture  of  the 
voices  of  two  speakers  which  emit  two  sources  si(f)  and  s2(t) 

Xi(t)  =  ai,iSi(f)  +  a1}2s2(t) 
x2  (t)  =  a2,isi(f)  +  a2fiS2(t) 

Estimate  si(t)  and  s2(t)  from  the  sole  knowledge  of  xx (t)  and  x2(t) 
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The  cocktail-party  problem 

Suppose  you  record  two  time  signals  xx (t)  and  x2(t)  form  two  different 
positions  in  a  room.  Each  recorded  signal  is  a  linear  mixture  of  the 
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Assume  the  sources  and  the  recorded  signals  are  samples  of  the 
zero-mean  random  variables  x1,x2,  (mixtures)  and  si,s2  (independent 
components). 

Assumption  si(t)  and  s2(t)  are  statistically  independent 

Unknown  source  signals  s (t)  =  [si(t), . . . ,  sn(t)]T 

Given  the  output  signals  x(t)  =  As(t),  x(t)  =  [xi (t), . .  .,xk(t)]T 

Unknown  mixing  matrix  Ap  x  n 
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Independent  Component  Analysis 


The  cocktail-party  problem 

Suppose  you  record  two  time  signals  xx (t)  and  x2(t)  form  two  different 
positions  in  a  room.  Each  recorded  signal  is  a  linear  mixture  of  the 
voices  of  two  speakers  which  emit  two  sources  si(f)  and  s2(t) 

Xi(t)  =  ai,isi(i)  +  ai,2S2(t) 

x2{t)  =  a2Asi(t)  +  a2,2s2(t) 

Estimate  si(t)  and  s2(t)  from  the  sole  knowledge  of  xi(t)  and  x2(t) 
Assume  the  sources  and  the  recorded  signals  are  samples  of  the 
zero-mean  random  variables  x1,x2,  (mixtures)  and  si,s2  (independent 
components). 

Assumption  si(t)  and  s2(t)  are  statistically  independent 

Unknown  source  signals  s (t)  =  [si(t), . . . ,  sn(t)]T 

Given  the  output  signals  x(t)  =  As(t),  x(t)  =  [xi (t), . .  .,xk(t)]T 

Unknown  mixing  matrix  Ap  x  n 

Find  approximations  y  of  s  by  constructing  a  de-mixing  matrix  W  and 


y  =  Wx. 
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Principles  for  reconstruction 


The  sum  of  two  independent  random  variables  usually  has  distribution 
closer  to  Gaussian  than  the  two  original  random  variables.  (Central  Limit 
Theorem) 

x  =  As 


Find 


y  =  Wx  s 


maximizing  nongaussianity. 


A  measure  of  nongaussianity  is  kurtosis, 


kurt(y)  =  E{y4}  -  3 (E{y2})2, 

with  y  of  unit  variance  kurt(y)  =  E{y4}  -  3. 
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Withening 


Preprocessing  of  the  output  signals  x  — >•  x  such  that  the 
components  of  x  are  uncorrelated  with  variances  equal  to  1 

£{xxt}  =  I. 
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Withening 


Preprocessing  of  the  output  signals  x  — >•  x  such  that  the 
components  of  x  are  uncorrelated  with  variances  equal  to  1 

£{xxt}  =  I. 

Use  for  example  E{xxT}  =  VDVT  and 

5  =  VD~1/2VTx  E{xxT}  =  I 
and  x  =  VD~l/2VTAs  =  is,  then 

E{xxT}  =  Ae{sst}At  =  AAT  =  I 
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Reconstruction 


Reconstruction  of  s.  We  can  look  for  a  de-mixing  matrix  W  s.t. 

WTW  =  Ip  and  y(t)  =  Wx(t )  solving 

min  D(W) 
wTw=ip 

D(W)  is  the  dependency  among  the  components. 


A.  Hyvarinen  and  E.  Oja  Independent  component  analysis:  A  tutorial, 

Neural  Networks. 


Lie  group  techniques 


Optimizing  via  gradient  flows 


Let  M  be  a  Reimannian  manifold  with  metric  m(-,  •),  given  4> :  M  — >  R 
a  smooth  function  the  equilibria  of 

x(t )  =  —  gr&d(f)(x(t)) 

are  the  critical  points  of  <p. 
grad 4>  is  such  that: 

■B  grad 4>{x)  G  TxAi 

3  cf>'\x  ( v )  =  m(grad</)(x),  v)  for  all  v  £  TXM 


Lie  group  techniques  for  Neural  Learning  -  p.12/2< 


Optimizing  via  gradient  flows 


Let  At  be  a  Reimannian  manifold  with  metric  m(-,  •),  given  </> :  M  —>  M 
a  smooth  function  the  equilibria  of 


are  the  critical  points  of  4>. 
grad 4>  is  such  that: 

grad 4>(x)  G  Tx At 

M  4>'\x  (v)  =  m{ grad</)(x),  v)  for  all  ?>  G  TXA( 

U.  Helmke  and  J.B.  Moore,  Optimization  and  Dynamical  Systems, 

Springer- Verlag  1994 

M.T.  Chu  and  K.R.  Drissel,  The  projected  gradient  method  for  least  squares  matrix 


Y.  Nishimori, 


S.l.  Amari, 


approximations  with  spectral  constraints, 

SIAM  J.  Num.  Anal.,  1990 

Natural  Gradient  Works  Efficiently  in  Learning, 

Neural  Computation,  1998 

Learning  algorithm  for  ICA  by  geodesic  flows  on  orthogonal 
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Optimizing  via  mechanical  systems  I 


Consider  S*  =  {[2mj,  w*]}  rigid  system  of  n  masses  m,  with  positions 
w i  (unitary  distance  form  the  origin  on  mutually  orthogonal  axis).  The 
masses  move  in  a  viscous  liquid.  No  translation. 


W  =  HW,  P  =  -nHW 
H  =  \{{F  +  P)WT  -  W {F  +  P)T 


fi  viscosity  parameter  P  matrix  of  the  viscosity  resistance 
W  matrix  of  the  positions  F  active  forces 
H  angular  velocity  matrix 


W  is  on  0(n)  or  on  the  Stiefel  manifold 
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Optimizing  via  mechanical  systems  II 


The  mechanical  system  seen  as  an  adapting  rule  for  neural  layers  with 
weight  matrix  W. 

The  forces 

F.=  _w 

aw 

with  U  a  potential  energy  function.  The  equilibria  of  the  mechanical 
systems  S*  are  at  the  local  minima  of  U. 

Take  U  =  Jc  cost  function  to  be  minimized,  or  U  =  -  Jo  objective 
function  to  be  maximized,  W(t),  t  — >  oo  approaches  the  solution  of 
the  optimization  problem. 

S.  Fiori,  ’Mechanical’  Neural  Learning  for  Blind  Source  Separation,  Electronics 
Letters,  1999 
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Reformulation  of  the  equations  when  n  «  p 


Using  the  Lie  algebra 

W  =  HW,  P  =  -11HW 
H  =  l[[F  +  P)WT -W(F  +  P)T' 

Using  the  tangent  space 

W  =  V 
V  =  g(V,  W) 

where 

V  =  ( GWt  -  WGt)W,  G  =  V  -  W(WTV/ 2  +  S) 

and 

g(V,  W)  =  ( LWt  -  WLt)W  +  (GWt  -  WGT)V,  L  —  G  —  GWTG 
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The  learning  algorithm 


/ 


< 


v 


l^n+1 

Gn 

Wn+1 


Vn  “H  hp(Vn^  H/i ) 

Vn  -  1/2 Wn(W^Vn) 
exp (h(GnW?  -  WnGTn))Wn 


with  W0  =  Inxp  and  V0  =  0nXp 

Here 


exp (h(GnW?  -  WnGl ))  =  [Wn,W^]ex p 


C-CT 

R 


-Rt 

O 


[Wn,  w, 


_LiT 
n  . 


and  C  =  W^Gn,  and  Gn  -  WnC  =  W^R.  We  exponentiate  matrices 
of  dimension  2 p  x  2 p  instead  of  n  x  n. 


Computational  cost 

For  the  exponential  9 np2  +  np  +  0(p 3)  flops.  For  the  overall  geodesic 
learning  algorithm  (one  step)  21np2  +  6 np  +  0(p3)  flops. 
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Computational  gain 


Computing  the  largest  eigenvalue  of  an  n  x  n  matrix  A  (discretization 
of  the  1-D  Laplacian  with  finite  differences). 

The  potential  energy  function  is  U(w)  =  -wT Aw,  p  —  1. 


SIZE  OF  A 

New  MEC 

Old  MEC 

n  =  32 

4.72  x  105 

1.31  x  106 

n  =  64 

1.82  x  106 

5.25  x  106 

n  =  128 

7.39  x  106 

2.10  x  107 

n  =  256 

2.49  x  107 

8.39  x  107 

Floating  point  operations  per  iteration  versus  the  size  of  the  problem. 
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Experiments  Blind  source  separation 


Original  images,  with  their  kurtosis  and  their  linear  mixtures 


Kurtosis  =  4.981 


Kurtosis  =  2.157 


Kurtosis  =  2.953 


Kurtosis  =  4.699 
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Source  separation 


The  force  F(W)  =  -kEx[x(xTW)3]. 


Recovered  image,  and  potential  energy 


Recovered  least-kurtotic  images 


Algorithm:  MEC3 
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References 


-•  E.  Celledoni  and  S.  Fiori,  Neural  learning  by  Geometric  Integration  of 
Reduced  ’Rigid-Body’  Equations,  J.  CAM  to  appear. 

M  E.  Celledoni  and  B.  Owren,  On  the  implementation  of  Lie  group  methods 
on  the  stiefei  manifold,  Numerical  Algorithms,  2003. 

Future  work 

b  On  the  orthogonal  group  consider  quasi-geodesic  paths  using 
low-rank  splittings 

B  Other  manifolds  occur  in  the  case  of  multi-layer  neural  networks: 
Flag  manifolds 

B  comparison  with  Newton  methods 
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Newton  methods,  Mahony’s  approach 


For  finding  minima  or  maxima  of  4> :  Q  — >  R,  and  Q  is  a  Lie  group, 

*  choose  an  inner  product  <  •,  •  >  on  the  Lie  algebra  Q  and  take  an 

orthonormal  basis  in  the  Lie  algebra  X1,.. . , Xd,  and  X1,...,Xd 
the  right  invariant  vector  fields 

3 

d  d 

grad<^  =  y^m(Xi,gxa,d(t>)Xi  =  ^(XjfyXj 

1=1  i=  1 

{m(X,  Y)  =<  X,  Y  >  (right  invariant  group  metric)) 

3  if  exp(X)cr  is  a  critical  point  of  0,  the  vector  field  X  satisfies, 

grad0(cr)  +  grad  (X<p)(a)  =  0 


R.  E.  Mahony  The  constrained  Newton  method  on  a  Lie  group  and  the  symmetric 
eigenvalue  problem  ,  Lin.  Alg.  and  Appl.  1996 
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Newton  methods,  Mahony’s  approach 


For  finding  minima  or  maxima  of  4> :  Q  — >  R,  and  Q  is  a  Lie  group, 

*  choose  an  inner  product  <  •,  •  >  on  the  Lie  algebra  Q  and  take  an 

orthonormal  basis  in  the  Lie  algebra  X1,.. . , Xd,  and  X1,...,Xd 
the  right  invariant  vector  fields 

d  d 

grad<^  =  y^m(Xi,gxa,d(t>)Xi  =  ^(XjfyXj 
1=1  2=1 

( m(X,Y )  =<  X,  Y  >  (right  invariant  group  metric)) 

3  Find  Xk  such  that  Xk  solves 

grad^(cjfe)  +  grad  {Xk<t>)(crk)  =  0 

set  crfe+i  =  exp(Xfc)crfe,  k  <—  k  +  1  and  continue,  (equivalent  to  Lie 

Euler  for  d  =  Xka,  a(0)  =  (jk) 

R.  E.  Mahony  The  constrained  Newton  method  on  a  Lie  group  and  the  symmetric 
eigenvalue  problem  ,  Lin.  Alg.  and  Appl.  1996 
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Newton  methods,  other  approaches 


-•  A.  Edelman,  T.  Arias,  S.T.  Smith,  The  geometry  of  Algorithms  with 
orthogonality  constrains,  SIAM  J.  Matrix  Anal.  Newton  methods  and 
Conjugate  Gradient  on  the  Stiefel  and  Grassman  manifolds. 

•  B.  Owren  and  B.  Welfert,  The  Newton  iteration  on  Lie  groups,  BIT  2000 

Context:  implicit  Lie  group  methods,  this  method  can  be  applied 
directly  in  the  implicit  integration  of  gradient  flows 

■  L.  Lopez,  C.  Mastroserio,  T.  Politi.  Newton-type  methods  for  solving 
nonlinear  equations  on  quadratic  matrix  groups.  J.  CAM  2000.  Similar  as 
previous  one,  using  the  Cayley  transformation 

-•  J.P.  Dedieu  and  D.  Nowicki,  Symplectic  methods  for  the  approximation  of 
the  exponential  and  the  Newton  sequence  on  Reimannian  submanifolds, 

Preprint  february  2004.  General  Reimanninan  manifold,  use  of 
tangent  space  parametrizations,  geodesic  seen  as  the  trajectory 
of  a  free  particle  attached  to  the  manifold 
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Diffusion-type  algorithms 


Perturbation  of  the  standard  Reimannian  gradient  to  obtain  a 
randomized  gradient.  Diffusion-type  gradient  on  SO  (n) 


n(n— 1)/2 

VdiS(t)  =  V(t)  +  V 20  Y,  Xk~dT 

k= 1 

V ( t )  deterministic  gradient,  Xk  is  a  basis  of  the  Lie  algebra  S  0  (n) 
orthogonal  with  respect  to  the  chosen  metric,  and  Wk(t)  are 
real-valued,  independent  standard  Wiener  processes  i.e.  a  random 

variable  W  continuous  in  t  s.t. 

-•  W(0)  =  0  with  probability  1 

for  0  <  r  <  t  the  random  variable  W(£)  -  W(r)  is  normally  distributed  with  mean 
zero  and  variance  t  —  r 

3  for  0  <  r  <  t  <  u  <  v,  the  increments  W(t)  -  W(r)  and  W{v)  -  W(u)  are 

statistically  independent 
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Diffusion-type  algorithms 


Perturbation  of  the  standard  Reimannian  gradient  to  obtain  a 
randomized  gradient.  Diffusion-type  gradient  on  SO  (n) 


n(n— 1)/2 

wo  =  v(t)  +  vw  y 

k= 1 

V ( t )  deterministic  gradient,  Xk  is  a  basis  of  the  Lie  algebra  S  0  (n) 
orthogonal  with  respect  to  the  chosen  metric,  and  Wk(t)  are 
real-valued,  independent  standard  Wiener  processes  The  learning 
differential  equation  is 


dW  Tr  ,  ,.TTr 

— 77"  =  — Vaiff  (f))TT^ 
at 

Langevin-type  stochastic  differential  equation  on  the  orthogonal  group 
X.  Liu,  A.  Srivastava,  K.  Galivan,  Optimal  linear  representation  of  images  for  object 
recognition,  IEEE  Trans.  Pattern  Analysis,  2004. 
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Conclusion 


3  Integration  of  learning  equations  and  gradient  flows  is  achieved 
with  simple  first  order  explicit  Lie  group  integrators 

.•  Efficient  approximation  of  the  matrix  exponential  from  a  Lie 
algebra  to  a  Lie  group  or  the  computation  of  geodesics  is  crucial 

.•  Development  of  methods  based  on  other  coordinate  maps  then 
the  exponential,  and  quasi-geodesic  strategies 

.•  Geometric  integration  of  stochastic  differential  equations 
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